INFO370: Practice & Review Day

Carson G Miller Rigoli, Swastik Singh

2024-11-07

Introduction

Agenda

  • DAG Practice
  • General Practice

DAG Practice

First, we’ll start with an activity to practice making DAGs to represent causal structures.

Go to Canvas, and download the first Quarto file.

%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '30px', 'fontFamily': 'Inter'}}}%%

graph LR
            T(Treatment) --> M(Mediator)
            M --> O(Outcome)
          C(Confounder) --> T
          C --> O
          T --> X(Collider)
          O --> X
    
            style T fill:white, stroke-width:0px
            style O fill:white, stroke-width:0px
            style M fill:white, stroke-width:0px
            style C fill:white, stroke-width:0px
            style X fill:white, stroke-width:0px

Practice Tasks

Today we’ll practice a lot of skills from class!

There are 6 tasks. For each one, you should create a new Quarto document, load the required data, and accomplish the goals described for the task.

This document and the data files are all available on Canvas!

Formatting Reports

Complete each task in a new Quarto document.

Include a title and author, and embed-resources: true

Make the report readable with short explanatory text notes.

Create headers for sections of your analysis with titles like “Setup”, “Loading Data”, “Data Prep”, “Model Fitting”, “Predictions”, or similar.

Write a short (1-3 sentences) at the end that directly points out the answer to the questions for the tasks. Include a header that says “Conclusions”.

Tasks and Skills

You can do the tasks out of order if you’d like! (4 and 5 are the longest)

  • Task 1: Data Wrangling
  • Task 2: Wrangling & Simple Linear Regression
  • Task 3: Confidence Intervals with Bootstrapping
  • Task 4: Linear Models for Causal Inference & DAGs
  • Task 5: Linear Models for Prediction
  • Task 6: Confidence Intervals & Choosing Summary Stats

Load Libraries

You’ll need tidyverse throughout.

library(tidyverse)
library(ggplot2)

Task 1: Find the Least Homogenous Region

National Survey on Drug Use & Health

  • Survey estimates how many people use different drugs (and other things) in the US.
  • Data is reported at the state level. You’ll characterize regional trends in US states.
  • Ages 12-17 is “youth”; Ages 18-25 is “young adult”; Ages 26+ is “older adult”
A map showing the United States split into the four CDC regions: West, South, Midwest, and Northeast

Load data

Note: Data only includes 2014. All numbers represent total number of individuals estimated in that state.

This code should be all you need to load the data.

Rows: 50
Columns: 18
$ State               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif…
$ Region              <chr> "South", "West", "West", "South", "West", "West", …
$ Year                <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20…
$ Population12_17     <int> 581887, 88028, 824776, 357677, 4716130, 625453, 45…
$ Population18_25     <int> 543988, 87646, 760569, 326239, 4577657, 598953, 37…
$ Population26plus    <int> 3119427, 430870, 4238637, 1888773, 24386075, 34043…
$ PainRelievers12_17  <int> 2100, 299, 2800, 1500, 15000, 2300, 1200, 300, 610…
$ PainRelievers18_25  <int> 5900, 800, 6200, 3100, 35899, 5400, 3300, 1000, 14…
$ PainRelievers26plus <int> 11800, 1500, 16900, 6600, 86800, 14000, 7000, 2100…
$ Alcohol12_17        <int> 1100, 100, 1800, 600, 8500, 1400, 800, 200, 3900, …
$ Alcohol18_25        <int> 5700, 1099, 9600, 3300, 58199, 8100, 5400, 1400, 2…
$ Alcohol26plus       <int> 16400, 2700, 30399, 8900, 146000, 23400, 14400, 32…
$ Cocaine12_17        <int> 200, 0, 500, 100, 2800, 400, 200, 0, 1000, 400, 0,…
$ Cocaine18_25        <int> 1700, 300, 4600, 800, 27300, 4200, 2500, 500, 1100…
$ Cocaine26plus       <int> 3099, 500, 7200, 1400, 38200, 6700, 3899, 900, 211…
$ Marijuana12_17      <int> 3799, 1000, 8200, 3000, 46300, 8500, 4500, 1000, 1…
$ Marijuana18.25      <int> 14500, 3000, 23100, 8500, 150600, 25300, 14799, 39…
$ Marijuana26plus     <int> 22100, 7300, 43900, 16200, 266400, 57100, 23400, 6…

Your Goals

It may be the case that different regions in the US have different youth drinking “cultures”. But just because the CSC defines a region as a group of states doesn’t mean that the pre-defined regions are all homogeneous – or similar.

Use the data provided to find the US region (West, Midwest, South, Northeast) that has the most diversity across its states in terms of youth drinking.

  • Create a visualization that compares the regions
  • Create an output that quantifies the diversity in youth drinking in each region, and orders the regions from most diverse to least .

Note:

  • Diversity in numeric variables can be quantified using standard deviation, interquartile range, or range.
  • States have different populations; To compare trends across states, you will need to look at proportions or percents rather than totals.

Task 1 Code

# A tibble: 4 × 5
  Region    mean_prop  sd_prop iqr_prop range_prop
  <chr>         <dbl>    <dbl>    <dbl>      <dbl>
1 West        0.00202 0.000499 0.000572   0.00180 
2 Northeast   0.00195 0.000346 0.000465   0.00101 
3 Midwest     0.00181 0.000297 0.000375   0.000944
4 South       0.00172 0.000139 0.000216   0.000478

Task 2: Quantify Relationship Between Youth Drinking and Adult Drinking

National Survey on Drug Use & Health

  • Survey estimates how many people use different drugs (and other things) in the US.
  • Data is reported at the state level. You’ll characterize regional trends in US states.
  • Ages 12-17 is “youth”; Ages 18-25 is “young adult”; Ages 26+ is “older adult”
A map showing the United States split into the four CDC regions: West, South, Midwest, and Northeast

Load data

Note: Data only includes 2014. All numbers represent total number of individuals estimated in that state.

This is the same data as from Task 1

Rows: 50
Columns: 18
$ State               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif…
$ Region              <chr> "South", "West", "West", "South", "West", "West", …
$ Year                <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20…
$ Population12_17     <int> 581887, 88028, 824776, 357677, 4716130, 625453, 45…
$ Population18_25     <int> 543988, 87646, 760569, 326239, 4577657, 598953, 37…
$ Population26plus    <int> 3119427, 430870, 4238637, 1888773, 24386075, 34043…
$ PainRelievers12_17  <int> 2100, 299, 2800, 1500, 15000, 2300, 1200, 300, 610…
$ PainRelievers18_25  <int> 5900, 800, 6200, 3100, 35899, 5400, 3300, 1000, 14…
$ PainRelievers26plus <int> 11800, 1500, 16900, 6600, 86800, 14000, 7000, 2100…
$ Alcohol12_17        <int> 1100, 100, 1800, 600, 8500, 1400, 800, 200, 3900, …
$ Alcohol18_25        <int> 5700, 1099, 9600, 3300, 58199, 8100, 5400, 1400, 2…
$ Alcohol26plus       <int> 16400, 2700, 30399, 8900, 146000, 23400, 14400, 32…
$ Cocaine12_17        <int> 200, 0, 500, 100, 2800, 400, 200, 0, 1000, 400, 0,…
$ Cocaine18_25        <int> 1700, 300, 4600, 800, 27300, 4200, 2500, 500, 1100…
$ Cocaine26plus       <int> 3099, 500, 7200, 1400, 38200, 6700, 3899, 900, 211…
$ Marijuana12_17      <int> 3799, 1000, 8200, 3000, 46300, 8500, 4500, 1000, 1…
$ Marijuana18.25      <int> 14500, 3000, 23100, 8500, 150600, 25300, 14799, 39…
$ Marijuana26plus     <int> 22100, 7300, 43900, 16200, 266400, 57100, 23400, 6…

Your Goals

Across the US in 2014, did states where lots of youth use marijuana tend to be states where lots of adults use marijuana? Or was the opposite the case?

  • Provide a visualization that lets you informally evaluate the question.
  • Formally quantify your answer with a single number. How much do youth marijuana use rates for states tend to be higher or lower based on adult marijuana use rates?

Note:

  • In this case, adult refers to all residents aged 18 and up.
  • This question is descriptive and non-inferential (we have data from all states, so don’t need to use confidence intervals)

Task 2 Code

[1] 0.855969

Task 3: Test for Differences in Income

Fictional Income Data

Here, we’ll consider some (not super realistic) synthetic data! Imagine it is a representative sample of working adults who live in the fictional country of Datania.

Imagine that you work with the Datania government which is building a phone app to help low-income residents easily access information about government financial support services. Datania is going to prototype the app, but can only afford to prototype it for one mobile phone system first (iPhones or Android in this case). If you want to make sure the app is prototyped with people who are likely to use it, does it matter which system to try out first?

Load Data

Note: Incomes are reported as percentile ranks but you can treat them like any other income measure. 100 means you have the best income in the country. 0 means you have the lowest income in the country. You can say something like “Income scores are 20 points higher” to describe the units.

Rows: 2,400
Columns: 4
$ income_rank            <int> 27, 4, 63, 73, 1, 80, 96, 59, 89, 80, 56, 27, 8…
$ higher_ed              <chr> "No", "No", "No", "Yes", "No", "Yes", "Yes", "Y…
$ iphone                 <chr> "No", "No", "Yes", "No", "No", "Yes", "Yes", "Y…
$ parent_max_income_rank <int> 45, 25, 78, 74, 23, 67, 85, 49, 67, 70, 80, 46,…

Your Goals

Based on the sample of residents in your data, estimate the difference in incomes for people with iPhones vs those without iPhones. Decide if you have evidence to support the idea that iPhones are a better or worse choice to prototype this system.

Note:

  • You only have a sample so will need to calculate a 95% confidence interval.
  • You will need to decide what summary statistic makes the most sense here.
  • This is not a causal analysis.

Task 3 Code

$mean_difference
[1] 39.84371

$ci_lower
[1] 38.18741

$ci_upper
[1] 41.50002

Task 4: Estimate Effect of Higher Education on Income

Higher Education and Income

Reuse the Datania data to estimate the average treatment effect of getting higher education on incomes! Does getting a college degree in Datania let you earn more money?

This is a causal analysis.

Load Data

Use the same data as in Task 3!

Rows: 2,400
Columns: 4
$ income_rank            <int> 27, 4, 63, 73, 1, 80, 96, 59, 89, 80, 56, 27, 8…
$ higher_ed              <chr> "No", "No", "No", "Yes", "No", "Yes", "Yes", "Y…
$ iphone                 <chr> "No", "No", "Yes", "No", "No", "Yes", "Yes", "Y…
$ parent_max_income_rank <int> 45, 25, 78, 74, 23, 67, 85, 49, 67, 70, 80, 46,…

Your Goals

Do your best job at estimating the average treatment effect of higher education on income given the available data. How much do you expect income to go up (in points) for people who get a higher education?

  1. Create a DAG to reason about how to estimate the average treatment effect of higher education on income.
    • You can include variable not recorded, or just the ones recorded.
  2. Use linear regression to estimate to the best of your ability with the data the average treatment effect of higher education on income. Report this as a 95% confidence interval.

Some useful things to consider:

  • Control for confounders. Don’t control for mediators or colliders!
  • confint() shows confidence intervals for models

Task 4 DAG

%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '30px', 'fontFamily': 'Inter'}}}%%

graph LR
  HE(Higher Education) --> I(Income)
  A(Age) --> HE
  A --> I
  E(Experience) --> I
  E --> HE
  F(Family Background) --> HE
  F --> I

  style HE fill:white, stroke-width:0px
  style I fill:white, stroke-width:0px
  style A fill:white, stroke-width:0px
  style E fill:white, stroke-width:0px
  style F fill:white, stroke-width:0px

Task 4 Code


Call:
lm(formula = income_rank ~ higher_ed, data = income_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-58.881 -13.135  -0.135  13.119  49.865 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.1354     0.5238   46.08   <2e-16 ***
higher_edYes  46.7457     0.7039   66.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.14 on 2398 degrees of freedom
Multiple R-squared:  0.6478,    Adjusted R-squared:  0.6477 
F-statistic:  4411 on 1 and 2398 DF,  p-value: < 2.2e-16
          2.5 % 97.5 %
higher_ed    NA     NA

DAGs

To create a DAG, you can use mermaid. You can copy this starter example.

%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '30px', 'fontFamily': 'Inter'}}}%%

graph LR
            I(Impact) --> F(Fracture Size)
            F --> R(Recovery Time)
          P(General Health) --> I
          P --> R
    
            style I fill:white, stroke-width:0px
            style F fill:white, stroke-width:0px
            style R fill:white, stroke-width:0px
            style P fill:white, stroke-width:0px

Task 5: Predict Birthweights

Best Guesses for Birthweights

Imagine you work at a (imaginary and irresponsible) hospital where there was a technical error that means that the birth weight for 15 babies born yesterday were not recorded. You do have other information about those babies, and want to create a model that would let you predict or fill in the missing information for those babies.

You should build this model and make predictions.

Later on, you find the missing birth weights. You should evaluate your model by seeing how far off your predictions were from the true weights.

Missing Record Babies

All of the babies were healthy and born full-term (between 37 and 41 weeks).

You have length of pregnancies in weeks (combgest), the age of mother (mager), and if the baby was a twin or single child.

You should only use found after you’ve made predictions.

Rows: 15
Columns: 5
$ name     <chr> "Alex", "Taylor", "Jordan", "Wei", "Noor", "Amal", "Hyeon", "…
$ mager    <int> 31, 37, 29, 31, 35, 35, 27, 24, 36, 32, 22, 28, 23, 35, 23
$ dplural  <chr> "Single", "Single", "Single", "Single", "Twin", "Twin", "Sing…
$ combgest <int> 39, 38, 38, 40, 38, 38, 39, 39, 40, 37, 37, 39, 41, 39, 40
$ found    <int> 3705, 3480, 3110, 3380, 2755, 2850, 3402, 3505, 4593, 3465, 3…

Load Training Data

This data includes a random sample of 10,000 US live births.

Use it to fit a model of live-births.

  • combgest is weeks of pregnancy
  • mager is the age of the mother at birth
  • dbwt is weight of the newborn at date of birth in grams
Rows: 10,000
Columns: 38
$ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ dbwt      <int> 2930, 3060, 3333, 1786, 3660, 3350, 3210, 3600, 3543, 2995, …
$ bwtr4     <chr> "Normal", "Normal", "Normal", "LBW", "Normal", "Normal", "No…
$ mracehisp <chr> "NH White", "NH Black", "Hispanic", "NH White", "Hispanic", …
$ fracehisp <chr> "NH White", "NH Black", "Hispanic", "NH White", "Hispanic", …
$ dmar      <chr> "Married", "Married", "Married", "Married", NA, "Married", "…
$ meduc     <chr> "Adv Degree", "Bachelor", "<HS", "<HS", "HS", "Adv Degree", …
$ feduc     <chr> "Bachelor", "HS", "<HS", "<HS", "HS", "HS", NA, "Bachelor", …
$ mager     <int> 30, 32, 24, 31, 23, 33, 31, 28, 26, 26, 26, 27, 22, 43, 22, …
$ fagecomb  <int> 30, 29, 26, 35, 30, 36, NA, 35, 26, 28, 25, NA, 30, 30, NA, …
$ priorlive <int> 0, 1, 0, 1, 1, 2, 1, 0, 0, 0, 1, 2, 0, 2, 5, 1, 1, 0, 0, 3, …
$ priordead <int> 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ priorterm <int> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 3, 0, 1, 0, 0, 0, 1, 0, 0, …
$ lbo_rec   <int> 1, 2, 1, NA, 2, 3, 2, 1, 1, 1, 2, 3, 1, 3, 6, 2, 2, 1, 1, 4,…
$ tbo_rec   <int> 2, 3, 1, NA, 2, 4, 2, 1, 1, 1, 3, 6, 1, 4, 6, 2, 2, 2, 1, 4,…
$ illb_r    <int> NA, 139, NA, NA, 52, 25, 40, NA, NA, NA, 96, 115, NA, 106, 1…
$ precare   <chr> "2", "2", "None", "3", "2", "3", "3", "1", "2", "1", NA, "2"…
$ wic       <chr> "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No", "No…
$ cig_rec   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ bmi       <dbl> 28.3, 40.3, 24.6, 35.4, 30.0, 22.8, 19.6, 22.0, 26.5, 17.0, …
$ rf_gdiab  <chr> "No", "No", "No", "No", "Yes", "No", "No", "No", "No", "No",…
$ rf_phype  <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No",…
$ rf_ghype  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_ppterm <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_inftr  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_cesar  <chr> "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No"…
$ risks     <chr> "No", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No…
$ ld_indl   <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No",…
$ ld_augm   <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
$ ld_ster   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ ld_antb   <chr> "No", "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No…
$ ld_anes   <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", …
$ apgar5    <int> 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 9, 9, 9, 9, 9, 9, 10, 9, 9, 8,…
$ dplural   <chr> "Single", "Single", "Single", "Single", "Single", "Single", …
$ sex       <chr> "Female", "Male", "Female", "Male", "Male", "Female", "Femal…
$ combgest  <int> 41, 39, 41, 38, 39, 38, 40, 40, 39, 35, 41, 38, 38, 38, 39, …
$ preterm   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes",…
$ ab_nicu   <chr> "No", "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes"…

Goals

Use a model to make predictions for the babies and check how well it did; Find the baby that you did the worst job at predicting! Make sure to:

  • Create a model to predict the birth weights of the 15 babies with lost records.
    • Build the model only with data from babies that were born full term (37 to 41 weeks) and discounting babies that were triplets or quadruplets.
  • Predict the weight for the missing records and display the predictions for each baby.
  • Calculate how far each prediction was from the true weight
  • Display a table that shows how bad your predictions were from worst to best for the babies

Note:

  • You could calculate predictions manually or use predict().
  • You’ll need to save your predictions and then subtract the found values

Task 6: Compare Birthweight to “Typical”

Heavy or Light Babies

Imagine that you have two friends who have children. Both believe that one of their children was born particularly heavy relative to babies born in similar situations. You want to test this and decide if their babies were born on the heavier side or lighter side of comparable babies.

In this case, “comparable” is going to refer to other babies that were born healthy, had mothers who were a similar age and who were pregnant for the same length of time. Also, twins should only be compared with twins and single babies with single babies.

Load Data

Here is info about your friends’ babies.

  name dplural mager combgest dbwt
1 Alex  Single    35       40 3540
2  Sam    Twin    28       39 3000

Load Data

Reuse the births data from Task 5.

  • combgest is weeks of pregnancy
  • mager is the age of the mother at birth
  • dbwt is weight of the newborn at date of birth in grams
Rows: 10,000
Columns: 38
$ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ dbwt      <int> 2930, 3060, 3333, 1786, 3660, 3350, 3210, 3600, 3543, 2995, …
$ bwtr4     <chr> "Normal", "Normal", "Normal", "LBW", "Normal", "Normal", "No…
$ mracehisp <chr> "NH White", "NH Black", "Hispanic", "NH White", "Hispanic", …
$ fracehisp <chr> "NH White", "NH Black", "Hispanic", "NH White", "Hispanic", …
$ dmar      <chr> "Married", "Married", "Married", "Married", NA, "Married", "…
$ meduc     <chr> "Adv Degree", "Bachelor", "<HS", "<HS", "HS", "Adv Degree", …
$ feduc     <chr> "Bachelor", "HS", "<HS", "<HS", "HS", "HS", NA, "Bachelor", …
$ mager     <int> 30, 32, 24, 31, 23, 33, 31, 28, 26, 26, 26, 27, 22, 43, 22, …
$ fagecomb  <int> 30, 29, 26, 35, 30, 36, NA, 35, 26, 28, 25, NA, 30, 30, NA, …
$ priorlive <int> 0, 1, 0, 1, 1, 2, 1, 0, 0, 0, 1, 2, 0, 2, 5, 1, 1, 0, 0, 3, …
$ priordead <int> 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ priorterm <int> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 3, 0, 1, 0, 0, 0, 1, 0, 0, …
$ lbo_rec   <int> 1, 2, 1, NA, 2, 3, 2, 1, 1, 1, 2, 3, 1, 3, 6, 2, 2, 1, 1, 4,…
$ tbo_rec   <int> 2, 3, 1, NA, 2, 4, 2, 1, 1, 1, 3, 6, 1, 4, 6, 2, 2, 2, 1, 4,…
$ illb_r    <int> NA, 139, NA, NA, 52, 25, 40, NA, NA, NA, 96, 115, NA, 106, 1…
$ precare   <chr> "2", "2", "None", "3", "2", "3", "3", "1", "2", "1", NA, "2"…
$ wic       <chr> "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No", "No…
$ cig_rec   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ bmi       <dbl> 28.3, 40.3, 24.6, 35.4, 30.0, 22.8, 19.6, 22.0, 26.5, 17.0, …
$ rf_gdiab  <chr> "No", "No", "No", "No", "Yes", "No", "No", "No", "No", "No",…
$ rf_phype  <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No",…
$ rf_ghype  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_ppterm <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_inftr  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ rf_cesar  <chr> "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No"…
$ risks     <chr> "No", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No…
$ ld_indl   <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No",…
$ ld_augm   <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
$ ld_ster   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ ld_antb   <chr> "No", "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No…
$ ld_anes   <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", …
$ apgar5    <int> 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 9, 9, 9, 9, 9, 9, 10, 9, 9, 8,…
$ dplural   <chr> "Single", "Single", "Single", "Single", "Single", "Single", …
$ sex       <chr> "Female", "Male", "Female", "Male", "Male", "Female", "Femal…
$ combgest  <int> 41, 39, 41, 38, 39, 38, 40, 40, 39, 35, 41, 38, 38, 38, 39, …
$ preterm   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes",…
$ ab_nicu   <chr> "No", "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes"…

Goals

For each baby, decide if you can find evidence that it was a “large baby” relative to other babies that were born under similar circumstances (similar mother age, pregnancy time, and plurality).

For each baby:

  • Create a 95% confidence interval for a statistic that would let you know what the most “typical” baby weight is for babies that are comparable to your friend’s child.
  • Decide if you can confidently say that the baby is above or below the most typical baby.

Advanced Task: LAD Regression

Heavy or Light Babies

Redo Task 6, but instead of estimating statistics for two subgroups of babies and comparing the estimates to Sam and Alex’s weights, build a model.

LS regression uses lm() and essentially tells you how the mean value of a group depends on predictors. LAD regression uses lad() and essentially tells us how the median value of a group depends on predictors. When you make a prediction with LAD, you are basically saying that the median value would be that number. To use lad(), you’ll need the L1pack library.

Get to it!

Download the Files

Go to Canvas and download the folder with data and this file!